smdiAn R package to perform routine structural missing data investigations in real-world data
Division of Pharmacoepidemiology and Pharmacoeconomics
Brigham and Women’s Hospital
Harvard Medical School
June 6, 2023
Disclosures
Administrative insurance claims databases are increasingly linked to electronic health records (EHR) to improve confounding adjustment for variables which cannot be measured in administrative claims
Examples:
These covariates are often just partially observed for various reasons:
Missing data in confounding factors are frequent
Two common missing data taxonomies
Unresolved challenges for causal inference:
Objectives of the Sentinel Innovation Center Causal Inference Workstream
R package to implement framework and missing data investigations on a routine basisCausal diagrams/M-graphs1,2 provide a more natural way to understand the assumptions regarding missing (confounder) data for a given research question, Legend: a) Missing completely at random (MCAR), b) Missing at random (MAR), c) Missing not at random 1 (MNAR unmeasured), d) Missing not at random 2 (MNAR value), Notation: E = Exposure, Y = Outcome, C1 = Fully observed confounders, C = Confounder of interest, C_obs = Observed portion of C, M = Missingness indicator
Observations
Plasmode simulation results averaged across all scenarios and simulated datasets.
The observed diagnostic pattern of a specific study will give insights into the likelihood of underlying missingness structures
The observed diagnostic pattern of a specific study will give insights into the likelihood of underlying missingness structures
The smdi package aims to streamline these structural missing data diagnostics (and more)!
… let’s walk through some examples and functionalities of smdi
smdi bundled datasetssmdi package comes with two exemplary simulated datasets:
smdi_data (includes some partially observed covariates)smdi_data_complete (complete dataset if you prefer to introduce NA yourself)Rows: 2,500
Columns: 14
$ exposure <int> 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0,…
$ age_num <dbl> 35.24, 51.18, 88.17, 50.79, 40.52, 64.57, 73.58, 42.38, …
$ female_cat <fct> 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1,…
$ smoking_cat <fct> 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1,…
$ physical_cat <fct> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,…
$ alk_cat <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ histology_cat <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,…
$ ses_cat <fct> 2_middle, 3_high, 2_middle, 2_middle, 2_middle, 2_middle…
$ copd_cat <fct> 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1,…
$ eventtime <dbl> 5.000000000, 4.754220474, 0.253391563, 5.000000000, 5.00…
$ status <int> 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,…
$ ecog_cat <fct> 1, NA, 0, 1, NA, 0, 1, 0, 1, NA, 1, NA, NA, 1, 1, 0, 1, …
$ egfr_cat <fct> NA, 0, 1, NA, 1, NA, NA, 0, NA, 0, 1, NA, 0, NA, NA, 0, …
$ pdl1_num <dbl> 45.03, NA, 41.74, 45.51, 31.28, NA, 47.28, 37.28, 46.47,…
smdi functions automatically include all variables with at least one missing value (default)covar parameterOverall
Stratified by another variable (stratum-specific sample size is the denominator)
smdi uses a re-export of the naniar3 gg_miss_upset and mice4 md.pattern functions to investigate potentially underlying missing data patterns
Note
Monotone and non-monotone (or general). A missing data pattern is said to be monotone if the variables \(Y_j\) can be ordered such that if \(Y_j\) is missing then all variables \(Y_k\) with \(k>j\) are also missing. This occurs, for example, in longitudinal studies with drop-out. If the pattern is not monotone, it is called non-monotone or general.4
smdi uses a re-export of the naniar3 gg_miss_upset function to investigate potentially underlying missing data patterns
smdi_asmdGroup 1 diagnostics: Differences in covariate distributions
smdi_asmdGroup 1 diagnostics: Differences in covariate distributions
# A tibble: 3 × 4
covariate asmd_median asmd_min asmd_max
* <chr> <chr> <chr> <chr>
1 ecog_cat 0.029 0.003 0.071
2 egfr_cat 0.243 0.010 0.485
3 pdl1_num 0.062 0.019 0.338
The output returns an asmd object that much more information than what is captured in the S3 generic print output, e.g. a complete ‘Table 1’ that displays the covariate distributions of patients:
Stratified by pdl1_num_NA
0 1 p test SMD
n " 1983" " 517" "" "" ""
exposure (mean (SD)) " 0.43 (0.50)" " 0.27 (0.45)" "<0.001" "" " 0.338"
age_num (mean (SD)) "60.60 (14.04)" "62.07 (14.47)" " 0.036" "" " 0.103"
female_cat = 1 (%) " 717 (36.2) " " 205 (39.7) " " 0.157" "" " 0.072"
smoking_cat = 1 (%) " 990 (49.9) " " 263 (50.9) " " 0.739" "" " 0.019"
physical_cat = 1 (%) " 707 (35.7) " " 175 (33.8) " " 0.476" "" " 0.038"
smdi_asmdGroup 1 diagnostics: Differences in covariate distributions
Investigators can also inspect standardized mean differences5 by covariate in detail:
smdi_hotellingGroup 1 diagnostics: Differences in covariate distributions
Hotelling’s6 multivariate t-test examines differences in covariate distributions conditional on having an observed covariate value or not. Rejection of \(H0\) would indicate significant differences between these patient strata.
smdi_littleGroup 1 diagnostics: Differences in covariate distributions
Little’s7 chi-square test takes into account possible patterns of missingness across all variables in the dataset. A high test statistics and low p-value (rejection of \(H0\)) would indicate that the global missing data generating mechanism is not completely at random.
smdi_rfGroup 2 diagnostics: Ability to predict missingness
The smdi_rf function trains and fits a random forest model to assess the ability to predict missingness for the specified covariate(s).8
# A tibble: 3 × 2
covariate rf_auc
* <chr> <chr>
1 ecog_cat 0.510
2 egfr_cat 0.629
3 pdl1_num 0.516
Parallelization
Depending on the amount of data (sample size x covariates), the computation of the function can take some minutes. To speed this up, investigators can parallelize the computation using n_cores (UNIX only).
smdi_rfThe resulting smdi_rf object provides the flexibility to investigate the covariate importance of predictors which can give important hints on the potentially underlying missing data generating mechanism.
smdi_outcomeGroup 3 diagnostic focuses on assessing the association between the missing indicator of the partially observed covariate and the outcome under study (is the missingness differential?).
outcome <- smdi_outcome(
data = smdi_data,
model = "cox",
form_lhs = "Surv(eventtime, status)",
exponentiated = FALSE
)
outcome# A tibble: 3 × 3
covariate estimate_crude estimate_adjusted
<chr> <glue> <glue>
1 ecog_cat -0.06 (95% CI -0.16, 0.03) -0.06 (95% CI -0.16, 0.03)
2 egfr_cat 0.06 (95% CI -0.03, 0.15) -0.01 (95% CI -0.10, 0.09)
3 pdl1_num 0.12 (95% CI 0.01, 0.23) 0.11 (95% CI -0.00, 0.22)
Supported regression types
Currently, the main types of outcome regressions are supported, namely logistic (glm), linear (lm) and Cox proportional hazards (survival) models are supported and need to be specified using the model and form_lhs.
smdi_diagnoseOne function to rule them all: smdi_diagnose
Let’s take a look at a most minimal example
diagnostics <- smdi_diagnose(
data = smdi_data,
model = "cox",
form_lhs = "Surv(eventtime, status)",
n_cores = 3
)
diagnosticssmdi summary table:
# A tibble: 3 × 6
covariate asmd_median_min_max hotteling_p rf_auc estimate_crude
<chr> <chr> <chr> <chr> <glue>
1 ecog_cat 0.029 (0.003, 0.071) 0.783 0.510 -0.06 (95% CI -0.16, 0.03)
2 egfr_cat 0.243 (0.010, 0.485) <.001 0.629 0.06 (95% CI -0.03, 0.15)
3 pdl1_num 0.062 (0.019, 0.338) <.001 0.516 0.12 (95% CI 0.01, 0.23)
# ℹ 1 more variable: estimate_adjusted <glue>
p_little: <.001
smdi_diagnoseOutput is a list that resembles all three group diagnostics validated in the plasmode simulation study…
Covariate-specific table:
# A tibble: 3 × 6
covariate asmd_median_min_max hotteling_p rf_auc estimate_crude
<chr> <chr> <chr> <chr> <glue>
1 ecog_cat 0.029 (0.003, 0.071) 0.783 0.510 -0.06 (95% CI -0.16, 0.03)
2 egfr_cat 0.243 (0.010, 0.485) <.001 0.629 0.06 (95% CI -0.03, 0.15)
3 pdl1_num 0.062 (0.019, 0.338) <.001 0.516 0.12 (95% CI 0.01, 0.23)
# ℹ 1 more variable: estimate_adjusted <glue>
Global Little’s test p-value:
smdi_style_gtsmdi_style_gt takes an object of class smdi (i.e., the output of smdi_diagnose) and formats it into a publication-ready gt table:
| Covariate | ASMD (min/max)1 | p Hotelling1 | AUC2 | beta crude (95% CI)3 | beta (95% CI)3 |
|---|---|---|---|---|---|
| ecog_cat | 0.029 (0.003, 0.071) | 0.783 | 0.510 | -0.06 (95% CI -0.16, 0.03) | -0.06 (95% CI -0.16, 0.03) |
| egfr_cat | 0.243 (0.010, 0.485) | <.001 | 0.629 | 0.06 (95% CI -0.03, 0.15) | -0.01 (95% CI -0.10, 0.09) |
| pdl1_num | 0.062 (0.019, 0.338) | <.001 | 0.516 | 0.12 (95% CI 0.01, 0.23) | 0.11 (95% CI -0.00, 0.22) |
| p little: <.001, Abbreviations: ASMD = Median absolute standardized mean difference across all covariates, AUC = Area under the curve, beta = beta coefficient, CI = Confidence interval, max = Maximum, min = Minimum | |||||
| 1 Group 1 diagnostic: Differences in patient characteristics between patients with and without covariate | |||||
| 2 Group 2 diagnostic: Ability to predict missingness | |||||
| 3 Group 3 diagnostic: Assessment if missingness is associated with the outcome (crude, adjusted) | |||||
smdi_style_gtSince smdi_style_gt transforms the smdi object into an object of class gt_tbl, an investigator can also take advantage of all of the gt package perks, e.g. exporting the table in different formats, e.g. .docx, .rtf, .pdf, etc.:
Vignettes/tutorials: janickweberpals.gitlab-pages.partners.org/smdi
Presentation quarto code: gitlab-scm.partners.org/drugepi/NESS2023
Presentation slides: drugepi.gitlab-pages.partners.org/NESS2023/ness2023.html
Mass General Brigham
Duke
Kaiser WA
Harvard Pilgrim/SOC
FDA
References cited in this presentation
smdi - An R package to perform routine structural missing data investigations in real-world data